coulomb matrix
Learning Invariant Representations of Molecules for Atomization Energy Prediction
The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy.
- North America > United States > New York (0.04)
- North America > United States > Illinois > Cook County > Lemont (0.04)
- North America > Canada (0.04)
- (2 more...)
Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors
We demonstrate accurate data-starved models of molecular properties for interpolation in chemical compound spaces with low-dimensional descriptors. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of Coulomb matrices. To account for the shape and composition of molecules, we combine these descriptors with six-dimensional features informed by the Gershgorin circle theorem. We use the nine-dimensional descriptors thus obtained for Gaussian process regression based on kernels with variable functional form, leading to extremely efficient, low-dimensional interpolation models. The resulting models trained with 100 molecules are able to predict the product of entropy and temperature ($S \times T$) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol$^{-1}$ for $> 78$ \% and under 1.3 kcal mol$^{-1}$ for $> 92$ \% of molecules in the test data. The test data comprises 20,000 molecules with complexity varying from three atoms to 29 atoms and the ranges of $S \times T$ and ZPVE covering 36 kcal mol$^{-1}$ and 161 kcal mol$^{-1}$, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
Mol-PECO: a deep learning model to predict human olfactory perception from molecular structures
Zhang, Mengji, Hiki, Yusuke, Funahashi, Akira, Kobayashi, Tetsuya J.
While visual and auditory information conveyed by wavelength of light and frequency of sound have been decoded, predicting olfactory information encoded by the combination of odorants remains challenging due to the unknown and potentially discontinuous perceptual space of smells and odorants. Herein, we develop a deep learning model called Mol-PECO (Molecular Representation by Positional Encoding of Coulomb Matrix) to predict olfactory perception from molecular structures. Mol-PECO updates the learned atom embedding by directional graph convolutional networks (GCN), which model the Laplacian eigenfunctions as positional encoding, and Coulomb matrix, which encodes atomic coordinates and charges. With a comprehensive dataset of 8,503 molecules, Mol-PECO directly achieves an area-under-the-receiver-operating-characteristic (AUROC) of 0.813 in 118 odor descriptors, superior to the machine learning of molecular fingerprints (AUROC of 0.761) and GCN of adjacency matrix (AUROC of 0.678). The learned embeddings by Mol-PECO also capture a meaningful odor space with global clustering of descriptors and local retrieval of similar odorants. Our work may promote the understanding and decoding of the olfactory sense and mechanisms.
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Shape is (almost) all!: Persistent homology features (PHFs) are an information rich input for efficient molecular machine learning
3-D shape is important to chemistry, but how important? Machine learning works best when the inputs are simple and match the problem well. Chemistry datasets tend to be very small compared to those generally used in machine learning so we need to get the most from each datapoint. Persistent homology measures the topological shape properties of point clouds at different scales and is used in topological data analysis. Here we investigate what persistent homology captures about molecular structure and create persistent homology features (PHFs) that encode a molecule's shape whilst losing most of the symbolic detail like atom labels, valence, charge, bonds etc. We demonstrate the usefulness of PHFs on a series of chemical datasets: QM7, lipophilicity, Delaney and Tox21. PHFs work as well as the best benchmarks. PHFs are very information dense and much smaller than other encoding methods yet found, meaning ML algorithms are much more energy efficient. PHFs success despite losing a large amount of chemical detail highlights how much of chemistry can be simplified to topological shape.
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Materials > Chemicals (0.69)
- Health & Medicine > Therapeutic Area > Oncology (0.40)
A photonic chip-based machine learning approach for the prediction of molecular properties
Zhang, Hui, Lau, Jonathan Wei Zhong, Wan, Lingxiao, Shi, Liang, Cai, Hong, Luo, Xianshu, Lo, Patrick, Lee, Chee-Kong, Kwek, Leong-Chuan, Liu, Ai Qun
Machine learning methods have revolutionized the discovery process of new molecules and materials. However, the intensive training process of neural networks for molecules with ever-increasing complexity has resulted in exponential growth in computation cost, leading to long simulation time and high energy consumption. Photonic chip technology offers an alternative platform for implementing neural networks with faster data processing and lower energy usage compared to digital computers. Photonics technology is naturally capable of implementing complex-valued neural networks at no additional hardware cost. Here, we demonstrate the capability of photonic neural networks for predicting the quantum mechanical properties of molecules. To the best of our knowledge, this work is the first to harness photonic technology for machine learning applications in computational chemistry and molecular sciences, such as drug discovery and materials design. We further show that multiple properties can be learned simultaneously in a photonic chip via a multi-task regression learning algorithm, which is also the first of its kind as well, as most previous works focus on implementing a network in the classification task.
- North America > United States > California > Merced County > Merced (0.14)
- Asia > Singapore > Central Region > Singapore (0.04)
- North America > United States > Michigan (0.04)
- (3 more...)
- Materials (0.66)
- Energy (0.66)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.48)
DScribe: Library of Descriptors for Machine Learning in Materials Science
Himanen, Lauri, Jäger, Marc O. J., Morooka, Eiaki V., Canova, Filippo Federici, Ranawat, Yashasvi S., Gao, David Z., Rinke, Patrick, Foster, Adam S.
DScribe is a software package for machine learning that provides popular feature transformations ("descriptors") for atomistic materials simulations. DScribe accelerates the application of machine learning for atomistic property prediction by providing user-friendly, off-the-shelf descriptor implementations. The package currently contains implementations for Coulomb matrix, Ewald sum matrix, sine matrix, Many-body Tensor Representation (MBTR), Atom-centered Symmetry Function (ACSF) and Smooth Overlap of Atomic Positions (SOAP). Usage of the package is illustrated for two different applications: formation energy prediction for solids and ionic charge prediction for atoms in organic molecules. The package is freely available under the open-source Apache License 2.0.
- Europe > United Kingdom (0.14)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- North America > United States (0.04)
- (5 more...)
Constant Size Molecular Descriptors For Use With Machine Learning
Collins, Christopher R., Gordon, Geoffrey J., von Lilienfeld, O. Anatole, Yaron, David J.
A set of molecular descriptors whose length is independent of molecular size is developed for machine learning models that target thermodynamic and electronic properties of molecules. These features are evaluated by monitoring performance of kernel ridge regression models on well-studied data sets of small organic molecules. The features include connectivity counts, which require only the bonding pattern of the molecule, and encoded distances, which summarize distances between both bonded and non-bonded atoms and so require the full molecular geometry. In addition to having constant size, these features summarize information regarding the local environment of atoms and bonds, such that models can take advantage of similarities resulting from the presence of similar chemical fragments across molecules. Combining these two types of features leads to models whose performance is comparable to or better than the current state of the art. The features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
Localized Coulomb Descriptors for the Gaussian Approximation Potential
Barker, James, Bulin, Johannes, Hamaekers, Jan, Mathias, Sonja
We introduce a novel class of localized atomic environment representations, based upon the Coulomb matrix. By combining these functions with the Gaussian approximation potential approach, we present LC-GAP, a new system for generating atomic potentials through machine learning (ML). Tests on the QM7, QM7b and GDB9 biomolecular datasets demonstrate that potentials created with LC-GAP can successfully predict atomization energies for molecules larger than those used for training to chemical accuracy, and can (in the case of QM7b) also be used to predict a range of other atomic properties with accuracy in line with the recent literature. As the best-performing representation has only linear dimensionality in the number of atoms in a local atomic environment, this represents an improvement both in prediction accuracy and computational cost when considered against similar Coulomb matrix-based methods.
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany (0.04)
Learning Invariant Representations of Molecules for Atomization Energy Prediction
Montavon, Grégoire, Hansen, Katja, Fazli, Siamac, Rupp, Matthias, Biegler, Franziska, Ziehe, Andreas, Tkatchenko, Alexandre, Lilienfeld, Anatole V., Müller, Klaus-Robert
The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to the holy grail of ''chemical accuracy''.
- North America > United States > New York (0.04)
- North America > United States > Illinois > Cook County > Lemont (0.04)
- North America > Canada (0.04)
- (2 more...)
Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
Rupp, Matthias, Tkatchenko, Alexandre, Müller, Klaus-Robert, von Lilienfeld, O. Anatole
Cross-validation on 7165 molecules yields a mean absolute error of 9.9 kcal/mol, which is an order of magnitude more accurate than counting bonds or semiempirical quantum chemistry. We use the GDB data base, a library of nearly one billion organic molecules that are stable and synthetically accessible according to organic chemistry rules [15]. While potentially applicable to any stoichiometry, as a proof of principle we restrict ourselves to small organic molecules. Specifically, we define a controlled test-bed consisting of all 7165 organic molecules from the GDB data base with up to seven "heavy" atoms that contain C, N, O, or S, being saturated with hydrogen atoms. Atomization energies range from -800 to -2000 kcal/mol.
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- Europe > Germany > Berlin (0.05)
- North America > United States > New York (0.04)
- (4 more...)